knitr::opts_chunk$set(
  echo = TRUE,
  cache = TRUE,
  autodep = TRUE,
  cache.comments = FALSE
)

options(scipen = 999)

cbpalette <- multi.utils::cbpalette()

library(tidyverse)

First simulated example

We start by importing the data generated in Introduction via simple version, Simulating metagenomic time series data.

df <- readRDS("depends/sample0.rds") %>%
  as_tibble()

A full description of the data is provided in the linked notebook, but to give a quick idea: the plot below shows each draw from the two regimes, baseline and exponential, shown as feint lines, together with corresponding regime 95% intervals shown as shaded region and regime mean shown as bold line.

sample_summary <- df %>%
  group_by(day, regime) %>%
  summarise(
    count_upper = quantile(count, 0.95),
    count_median = median(count),
    count_lower = quantile(count, 0.05)
  ) %>%
  ungroup()
## `summarise()` has grouped output by 'day'. You can override using the `.groups` argument.
ggplot(sample_summary, aes(x = day, ymin = count_lower, y = count_median, ymax = count_upper, group = regime)) +
    geom_ribbon(alpha = 0.1, aes(fill = regime)) +
    geom_line(aes(col = regime), size = 1.5) +
    geom_line(data = df, aes(x = day, y = count, col = regime, group = id),
               alpha = 0.025, inherit.aes = FALSE) + 
    theme_minimal() +
    scale_color_manual(values = cbpalette) +
    scale_fill_manual(values = cbpalette) +
    guides(fill = "none") +
    labs(x = "Day", y = "Number of reads in sample", col = "Regime")

Performing inference

We fit a Poisson regression model (see Explaining Poisson regression) or Estimating exponential growth via sporadic detection) to each \(k\)-mer in the data. It is important to note that the generative process used to create the simulated data may be of greater complexity than Poisson regression inferential model. We take this approach because we expect any functional NAO system to need to be able to handle a large number of \(k\)-mers (or \(k\)-mer equivalence classes, etc.), and therefore we require the inferential approach to be highly scalable. Poisson regression is a comparatively simple inferential approach, which we hope can be shown via benchmarking (see Benchmarking Poisson regression for exponential growth detection) to be scalable to the number of \(k\)-mers we expect, which might be1 on the order of \(10^9\). If Poisson regression is feasible computationally, but has particular unsatisfactory statistical behaviour, then we could consider a more sophisticated inferential model to overcome specific problems.

#' Fit to each unique id
#' Note that (confusingly) kmer is not unique, and represents the kmer within a given organism
fits <- tibble(id = unique(df$id)) %>%
  mutate(
    fit = map(id, ~glm(count ~ 1 + day, family = "poisson", data = filter(df, id == .))),
    fit = map(fit, broom::tidy, conf.int = TRUE)
  ) %>%
  unnest(fit)

#' Add truth column
fits <- fits %>%
  left_join(
    select(df, id, regime),
    by = "id"
  )

Now we can look at the inference for the slope parameter \(\beta\) in each regression, first the point estimate with confidence interval, then the \(p\)-value for \(\beta \neq 0\). The truth – either exponential or baseline regime – is shown by the point colour.

fits_day <- filter(fits, term == "day")

fits_day %>%
  ggplot(aes(x = id, y = estimate, ymin = conf.low, ymax = conf.high, col = regime)) +
    geom_pointrange(alpha = 0.05, size = 0.5) +
    theme_minimal() +
    scale_color_manual(values = cbpalette) +
    labs(x = "k-mer ID", y = "Estimated slope", col = "True regime") +
    theme(
      legend.position = "bottom"
    )

fits_day %>%
  ggplot(aes(x = id, y = p.value, col = regime)) +
  geom_point() +
  theme_minimal() +
  scale_color_manual(values = cbpalette) +
  labs(x = "k-mer ID", y = "p-value for positive slope", col = "True regime") +
  theme(
    legend.position = "bottom"
  )

saveRDS(fits, "fits-sample0.rds")

Performance assessment

Suppose we classify as exponential growth when the 95% confidence interval for \(\beta\) is above zero. We can see which \(k\)-mers we misclassify by comparing this estimated regime to the true regime.

fits_day <- fits_day %>%
  mutate(
    est_regime = case_when(
      conf.low > 0 & conf.high > 0 ~ "Exponential",
      TRUE ~ "Baseline"
    ),
    correct = case_when(
      regime == est_regime ~ TRUE,
      TRUE ~ FALSE
    )
  )

fits_day %>%
  ggplot(aes(x = id, y = estimate, ymin = conf.low, ymax = conf.high, col = correct)) +
    geom_pointrange(alpha = 0.05, size = 0.5) +
    theme_minimal() +
    scale_color_manual(values = c("#D22B2B", "grey")) +
    labs(x = "k-mer ID", y = "Estimated slope", col = "True regime") +
    theme(
      legend.position = "bottom"
    )

fits_day %>%
  ggplot(aes(x = id, y = p.value, col = correct)) +
  geom_point() +
  theme_minimal() +
    scale_color_manual(values = c("#D22B2B", "grey")) +
  labs(x = "k-mer ID", y = "p-value for positive slope", col = "True regime") +
  theme(
    legend.position = "bottom"
  )

All of the \(k\)-mers which are exponentially increasing are classified as such. In other words, the false negative rate is zero. There are a small number of false positives: \(k\)-mers in the baseline regime which are classified as exponentially increasing.

fits_day <- fits_day %>%
  mutate(
    cm = case_when(
      regime == "Exponential" & est_regime == "Exponential" ~ "True positive",
      regime == "Exponential" & est_regime == "Baseline" ~ "False negative",
      regime == "Baseline" & est_regime == "Exponential" ~ "False positive",
      regime == "Baseline" & est_regime == "Baseline" ~ "True negative"
    )
  )

fits_day %>%
  janitor::tabyl(cm) %>%
  knitr::kable()
cm n percent
False positive 70 0.0083333
True negative 7490 0.8916667
True positive 840 0.1000000

We can highlight these few false positives in a plot.

df %>%
  left_join(
    select(fits_day, id, correct, cm),
    by = "id"
  ) %>%
  ggplot(aes(x = day, y = count, col = correct, alpha = correct, group = id)) +
    geom_line() + 
    theme_minimal() +
    scale_alpha_manual(values = c(1, 0.05)) +
    scale_color_manual(values = c("#D22B2B", "grey")) +
    guides(alpha = "none") + 
    labs(x = "Day", y = "Number of reads in sample", col = "Correctly classified?")

TODO: Compute classification probabilities.

The Brier score (see Evaluating exponential growth detection) is given by \[ \text{BS} = \frac{1}{n} \sum_{i = 1}^n \left(f_i - o_i \right)^2 \] where \(f_i\) is \(\mathbb{P} \left[ \text{regime}(i) = 1 \right]\) and \(o_i = \text{regime}(i) \in \{0, 1\}\).

Benchmarking

We can benchmark how long the Poisson regression model takes to fit using the function bench::mark:

bm <- bench::mark(
  glm(count ~ 1 + day, family = "poisson", data = filter(df, id == 1))
)

data_subset <- filter(df, id == 1)

bm2 <- bench::mark(
  glm(count ~ 1 + day, family = "poisson", data = data_subset)
) %>%
  print()
## # A tibble: 1 × 13
##   expression                                                     min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
##   <bch:expr>                                                   <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
## 1 glm(count ~ 1 + day, family = "poisson", data = data_subset) 749µs  831µs     1061.    4.86KB     6.48   491     3      463ms <glm>  <Rprofmem> <bench_tm> <tibble>
bm
## # A tibble: 1 × 6
##   expression                                                                min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>                                                           <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 glm(count ~ 1 + day, family = "poisson", data = filter(df, id == 1))   6.36ms   6.87ms      142.     274KB     9.02
bm2
## # A tibble: 1 × 6
##   expression                                                        min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>                                                   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 glm(count ~ 1 + day, family = "poisson", data = data_subset)    749µs    831µs     1061.    4.86KB     6.48
saveRDS(bm, "benchmark-sample0.rds")

With additional true abundance noise

df <- readRDS("depends/sample1.rds") %>%
  as_tibble()

Real data

counts <- read.table("data/counts-1pct-sample.tsv", header = TRUE)

summary(counts)
##        ec              day0               day1                day2                day3                day4                day5               day6          
##  Min.   :     0   Min.   :    0.00   Min.   :    0.000   Min.   :    0.000   Min.   :    0.000   Min.   :    0.000   Min.   :    0.00   Min.   :    0.000  
##  1st Qu.:250000   1st Qu.:    1.00   1st Qu.:    1.000   1st Qu.:    1.000   1st Qu.:    1.000   1st Qu.:    1.000   1st Qu.:    1.00   1st Qu.:    1.000  
##  Median :500000   Median :    2.00   Median :    3.000   Median :    2.000   Median :    2.000   Median :    2.000   Median :    2.00   Median :    2.000  
##  Mean   :500000   Mean   :    2.97   Mean   :    4.143   Mean   :    4.052   Mean   :    3.701   Mean   :    3.951   Mean   :    3.25   Mean   :    3.284  
##  3rd Qu.:749999   3rd Qu.:    4.00   3rd Qu.:    5.000   3rd Qu.:    5.000   3rd Qu.:    5.000   3rd Qu.:    5.000   3rd Qu.:    4.00   3rd Qu.:    4.000  
##  Max.   :999999   Max.   :18166.00   Max.   :15648.000   Max.   :15726.000   Max.   :13494.000   Max.   :19843.000   Max.   :17471.00   Max.   :13571.000  
##       day7                day8                day9               day10               day11               day12               day13               day14         
##  Min.   :    0.000   Min.   :    0.000   Min.   :    0.000   Min.   :    0.000   Min.   :    0.000   Min.   :    0.000   Min.   :    0.000   Min.   :    0.00  
##  1st Qu.:    1.000   1st Qu.:    1.000   1st Qu.:    1.000   1st Qu.:    1.000   1st Qu.:    1.000   1st Qu.:    2.000   1st Qu.:    1.000   1st Qu.:    1.00  
##  Median :    2.000   Median :    2.000   Median :    2.000   Median :    3.000   Median :    2.000   Median :    3.000   Median :    2.000   Median :    2.00  
##  Mean   :    3.803   Mean   :    3.908   Mean   :    3.199   Mean   :    4.293   Mean   :    3.472   Mean   :    5.277   Mean   :    3.411   Mean   :    2.88  
##  3rd Qu.:    4.000   3rd Qu.:    5.000   3rd Qu.:    4.000   3rd Qu.:    5.000   3rd Qu.:    4.000   3rd Qu.:    6.000   3rd Qu.:    4.000   3rd Qu.:    3.00  
##  Max.   :17873.000   Max.   :18258.000   Max.   :15092.000   Max.   :15555.000   Max.   :12785.000   Max.   :21143.000   Max.   :15335.000   Max.   :11376.00  
##      day15               day16         
##  Min.   :    0.000   Min.   :   0.000  
##  1st Qu.:    1.000   1st Qu.:   0.000  
##  Median :    3.000   Median :   1.000  
##  Mean   :    4.766   Mean   :   2.266  
##  3rd Qu.:    6.000   3rd Qu.:   3.000  
##  Max.   :22879.000   Max.   :8451.000
#' There are this many k-mers in the data
nrow(counts)
## [1] 1000000
#' These are the column names for counts
names(counts)
##  [1] "ec"    "day0"  "day1"  "day2"  "day3"  "day4"  "day5"  "day6"  "day7"  "day8"  "day9"  "day10" "day11" "day12" "day13" "day14" "day15" "day16"
#' This is what the top part of the data looks like
head(counts)
counts_subset <- filter(counts, ec < 1000)

#' In a tidy format for plotting and fitting
counts_subset_long <- counts_subset %>%
  pivot_longer(
    cols = starts_with("day"),
    names_to = "day",
    names_prefix = "day",
    values_to = "count"
  ) %>%
  mutate(day = as.integer(day))

#' Plot the time series
ggplot(counts_subset_long, aes(x = day, y = count, group = ec)) +
  geom_line(alpha = 0.05) +
  theme_minimal() +
  labs(x = "Day", y = "Count")

#' There is also a file which I assume is the total reads
row_sums <- read.table("data/daily-counts.tsv", header = TRUE)
row_sums

Notes on feedback from Sam

Status: short response, some points already integrated into document.

Major comments

  • More detail on expected sampling process would help: I assume this refers to the metagenomic sampling process, which I agree is a hole at the moment and something I’m looking to fill in further. For now, I just choose to use the a multinomial sample with number of reads as sample size as it seemed the simplest.
  • Some example data would be very useful to understand the process: I’m unsure exactly what “example data” refers to here, would it be real data or more examples of simulated data? Since there are examples of simulated data in there, I’d assume real data. I agree that we need to start looking at more real data (so far I’ve only looked at one example of real data).

Minor comments

  • More context on expected processes happening in background: Agree that what has been done is quite abstract and not very linked to a particular scenario or case-study. Since I’ve been talking about making things realistic, picking one case-study to focus in on and pick realistic values and settings for could be a good route.
  • Gaussian random walk creates large changes: Agree, I’ve found that draws from Gaussian growth rate with mean greater than zero (exponential regime) can be quite varied. Worth thinking about more.
  • Adjustments for prevalence and not incidence: Good point, I don’t know. For now we’re assuming that there will be an exponential signal in the \(k\)-mers, which I agree is probably proportional to prevalence more than incidence. I think it’s an open question whether we need to be adjusting for this (which potentially more epidemiological modelling [Charlie / Janvi?] could help with)

  1. To some extent, the number of \(k\)-mers can be varied in accordance with computational capacity.↩︎